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Using Qualitative Hypotheses to Identify Inaccurate Data 



Identifying inaccurate data has long been regarded as a significant and difficult prob- 
lem in AI. In this paper, we present a new method for identifying inaccurate data on the 
basis of qualitative correlations among related data. First, we introduce the definitions of 
related data and qualitative correlations among related data. Then we put forward a new 
concept called support coefficient function (SCF). SCF can be used to extract, represent, 
and calculate qualitative correlations among related data within a dataset. We propose an 
approach to determining dynamic shift intervals of inaccurate data, and an approach to 
calculating possibility of identifying inaccurate data, respectively. Both of the approaches 
are based on SCF. Finally we present an algorithm for identifying inaccurate data by 
using qualitative correlations among related data as confirmatory or disconfirmatory evi- 
dence. We have developed a practical system for interpreting infrared spectra by applying 
the method, and have fully tested the system against several hundred real spectra. The 
experimental results show that the method is significantly better than the conventional 
methods used in many similar systems. 

1. Introduction 

In many problems of artificial intelligence, inferences are drawn on the basis of interpretation 
or analysis of measured data. However, when measured data are inaccurate, interpreting 
or analyzing them is very difficult. In diagnosis or signal analysis, for example, the general 
reasoning method is to compare measured data with reference values (Reiter, 1987; Shortliffe 
& Buchanan, 1975). When measured data are not accurate due to noise or other unforeseen 
reasons, the comparison between measured data and reference values can not lead to any 
useful conclusion. A rule like "if there is a strong peak in 3000 cm^ 1 - 3100 cm^ 1 on the 
infrared spectrum of an unknown compound, then the unknown compound may contain at 
least one benzene-ring v may work in ideal cases. However, the rule can not work in general 
cases. For example, when the spectral data are inaccurate, e.g., the measured peak in 3000 
cmT 1 - 3100 cmT 1 is not a strong peak but a medium one, or a measured strong peak is 
not exactly located in 3000 cmT 1 - 3100 cmT 1 but is slightly shifted, the rule may not be 
applied. 

In practical problems, especially in data rich problems such as diagnosis and interpre- 
tation, measured data are often inaccurate. One reason is that the measuring methods are 
error-prone. For example, a patient's temperature or blood-pressure may be inaccurately 
measured or entered, and a witness may inaccurately describe the features of a criminal. 
The other reason is that the real data are not noise-free. For example, among the received 
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signals, there may be some noise mixed up, and what is worse, infrared spectral data (peaks) 
themselves may be noisy, i.e., some peaks may be affected by noise or other factors. 

Identifying inaccurate data has long been regarded as a significant and difficult problem 
in AI. Many methods have been proposed to deal with the problem. Fuzzy logic provides 
a mathematical framework for representation and calculation of inaccurate data (Zadeh, 
1978). By fuzzy logic, reference value xq is associated with a fuzzy interval /\x. If a 
measured data item faUs into [xq — /S.x, xq + As], then it can be identified as the reference 
value with a corresponding membership degree. Probability theory and possibility theory 
are also widely used for handling inaccuracy and uncertainty (Dempster, 1968; Duda, Hart, 
k Nilsson, 1976; Pearl, 1987; Shafer, 1976; Shortliffe & Buchanan, 1975). The above 
methods are commonly used in AI systems. The way of applying them, however, depends 
on the nature of domain problems, and there is not yet a standard and generally accepted 
method thus far. 

We present a method for identifying inaccurate data on the basis of qualitative corre- 
lations among related data. The method is based on the essential consideration that some 
data items within a dataset are qualitatively dependent: a set of data may describe the same 
phenomenon, or refer to the same behavior. For example, a patient's temperature, blood 
pressure and other symptomatic data reflect the patient's disease, and a couple of peaks on 
an infrared spectrum indicate the presence of a partial component. We call the dependency 
among data within a dataset qualitative correlations among related data 1 . By considering 
qualitative correlations among related data, we can obtain confirmatory or disconfirmatory 
evidence to identify inaccurate data. In general, related data should be simultaneously 
present or absent, so if most of the related data have been completely identified, these data 
will enhance the identification of the rest. For example, a benzene-ring can create many 
other peaks besides the strong peak in 3000 cm -1 - 3100 cm -1 . All the peaks created by the 
benzene-ring are related data which have qualitative correlations. If all the peaks except 
that in 3000 cm -1 - 3100 cm -1 have been completely identified, the benzene-ring is quite 
likely to be contained by the unknown compound. Therefore, the inaccurate peak around 
3000 cmT 1 - 3100 cmT 1 may still be identified. In fact, spectroscopists frequently use the 
following knowledge in addition to the rules given at the beginning of this section: 

// there is a strong peak around 3000 cm^ 1 - 3100 cmT 1 , then the spectrum may 

be partially created by benzene-rings check peaks around 1650 cm^ 1 , 1550 

cm -1 and 700 cm -1 - 900 cm -1 to make sure because a benzene-ring may have 
other peaks there at the same time. 

The central idea of our method is to find evidence for identifying inaccurate data by 
considering qualitative correlations among related data. The idea is very common in human 
thinking. When all the data except blood pressure of a patient show that the patient 
has a certain disease, we would naturally suspect that the blood pressure of the patient 
was inaccurately entered. Similarly, when all the peaks except one indicate that a partial 
component is present, we would naturally suspect that the unmatched peak was inaccurately 
measured or the peak was affected by noise or something else. If acceptable solutions can 
be made by assuming an inaccurate data item to be a reference value based on qualitative 

1. Detailed definitions will be given later. 
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correlations between the data item and its related data, the inaccurate data item may be 
compensated and hence identified. 

Our contributions include: (1) a method which assumes an inaccurate data item to be 
a certain reference value based on the qualitative correlations between the inaccurate data 
item and all of its related data, (2) an algorithm which crystallizes the method, and (3) a 
practical system which uses the algorithm to interpret infrared spectra. 

The key point is a new concept called support coefficient function (SCF) for extracting, 
representing, and calculating qualitative correlations among related data. When measured 
data are inaccurate, the qualitative correlations among related data can provide evidence 
for confirming or disconfirming the hypothesis that the measured data are the same as the 
reference values. An approach to determining dynamic shift intervals of inaccurate data, 
an approach to calculating possibility of identifying inaccurate data, and an algorithm for 
identifying inaccurate data are proposed on the basis of SCF, respectively. 

The method requires few assumptions in advance, so it can avoid inconsistency in knowl- 
edge and data bases. The method identifies inaccurate data by considering qualitative cor- 
relations among related data, so it is quite effective and efficient, especially in the case 
of problems where dependencies among data apparently exist. In general, qualitative cor- 
relations among data can always, more or less, be extracted. In the worst case where 
qualitative correlations are not known a priori, the method degenerates to a conventional 
fuzzy method 2 . 

We have developed a practical system for interpreting infrared spectra by using the 
method (Zhao & Nishida, 1994). The primary task of the system is to identify unknown 
compounds by interpreting their infrared spectra. We have fully tested the system against 
several hundred real spectra. The experimental results show that the method is significantly 
better than the traditional methods used in many similar systems. The rate of correctness 
(RC) and the rate of identification (RI) which are two important standards for evaluating 
the solutions of infrared spectrum interpretation are near 74% and 90% respectively, and 
the former is the highest among known systems. 

In the following sections, we first describe the problem of identifying inaccurate data in 
Section 2. In Section 3 we give some definitions including the concept of support coefficient 
function (SCF) and other concepts based on SCF. In Section 4 we introduce our method 
for identifying inaccurate data by considering qualitative correlations among related data. 
Section 5 demonstrates the application of the method to a knowledge-based system for 
infrared spectrum identification, and shows the experimental results of the system. Related 
work is discussed in Section 6. Conclusions are addressed in Section 7. 



2. Problem Description 

In practical problems, measured data can be represented as a finite set: 



2. We refer to the fuzzy methods which use an empirical fuzzy interval for each inaccurate data item as 
conventional fuzzy methods. 
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MD = {d u d 2 ,...,d n }, 
and reference values can also be represented as a finite set: 

RV = {r 1 ,r 2 , ...,r N }. 

Suppose interpreting or analyzing measured data is carried out on the basis of so-called 
"if-then" rules in which the premises are comparisons between MD and RV like "if di — rj 
then or "if (r, G MD) A (rj € MD) then When MD is accurate, the main 

operation implied by these premises is usuaUy to find a corresponding reference value from 
RV for each data item in MD. However, when MD is inaccurate, the operation becomes 
complicated. In this case, it is difficult to determine which reference value an inaccurate 
data item corresponds to, e.g., for some measured data no reference value may be simply 
identified, while for others more than one may be available. 

For example, if received signals are known to be accurate, and an expected signal (refer- 
ence value) can not be found from the signal series (measured data), then we can conclude 
that the expected signal does not appear. However, if received signals are inaccurate, and 
an expected signal can not be identified from the signal series, it is hard to decide whether 
the expected signal does not appear or appears but looks different due to the inaccuracy. 

Most currently known approaches for dealing with inaccurate data such as fuzzy logic 
and probabilistic reasoning are mainly based on quantitative similarity or closeness between 
measured data and reference values. In some cases, however, the identity of qualitative 
features is more effective and reliable than quantitative similarity or closeness. 

Consider signal analysis again. If an inaccurate signal has the same qualitative features 
as the expected one such as the interval of frequency, the signal may still be identified even 
though its quantitative features are slightly different from those of the expected one such 
as strength etc.; conversely, an inaccurate signal may not be identified if it is quantitatively 
similar to an expected signal but does not have the same qualitative features as the expected 
one. 

We discussed the following points in Section 1, (1) some data items within a dataset are 
qualitatively dependent (i.e., they are related data), (2) there are qualitative correlations 
among related data, and (3) qualitative correlations among related data enable us to confirm 
or disconfirm the identity of qualitative features. 

Therefore, RV and MD can be, explicitly or implicitly, divided into finite groups on 
the basis of qualitative dependencies among data, and the data in each group are related 
to each other. For example, RV can be divided into Ri, R 2 , ... and R^: 

RV = R 1 \J R 2 \J ...UR k , 

where 

R j = { r h I r h e RV , I <l <m}. 

The qualitative correlations among related data in Rj include: (1) data in Rj should be 
simultaneously present or absent which means that all reference values in Rj should have 
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corresponding data in MD, (2) the presence of rj p may enhance the presence of rj , and 
the absence of rj p may depress the presence of rj q . Considering the qualitative correlations 
among related data will lead to evidence for the identification of inaccurate data. 

The problem of interpreting/analyzing inaccurate data is to make qualitative hypotheses 
for MD, or in other words, to find a subset of RV for MD, which is corresponding to MD: 

IN(MD), (IN(MD) C RV). 
The problem can be briefly represented as the following predicate calculus: 

VdiVRj((di@Rj) A (Rj@MD) -»■ Rj C IN(MD)) 3 , 

where "di@Rj" and "Rj@MD" are two essential qualitative predicates in our method which 
represent that d{ possibly (qualitatively) belongs to Rj (i.e., ? d{ G Rj), and Rj possibly 
(qualitatively) belongs to MD (i.e., ? Rj C MD), respectively. Determining ll A@B v is 
based on qualitative correlations among related data. The work presented in this paper 
is mainly concentrated on determining ll di@Rj v and ll Rj@MD v , and realizing the above 
predicate calculus. 

3. Preliminaries 

Before introducing our method, we first put forward and explain several new concepts in 
this section. 

3.1 Qualitative Correlations among Related Data 

Definition 3.1 Related data: If data d\, cfe? and d m describe a common phenomenon, 
or they refer to the same behavior simultaneously, then they can be treated as related data. 

For example, a patient's temperature, blood pressure and other symptomatic data are 
related data, and all the features for describing a criminal are also related data. The phe- 
nomenon that some data within a dataset are related data is more apparent in engineering. 
For instance, there are two types of related data in infrared spectrum interpretation as 
shown in Figure 1. First, as far as a single peak is concerned, the frequency (position) 
strength (height) S{, and width (shape) W{ of the peak are related data. Second, a partial 
component may create numerous peaks at the same time. If we consider all the peaks that 
a partial component may create, all of these peaks are related data. 

Definition 3.2 Qualitative correlations among related data: If di and dj are two related data 
items, then the presence of di enhances the presence of dj, and the absence of di depresses 
the presence of dj. This kind of effect is called qualitative correlations among related data. 

3. Conflicts (overlaps) in IN(MD) should be eliminated. We will not discuss conflict-resolving in this 
paper, but will concentrate on the method for identifying inaccurate data, i.e., ? di@Rj and ? Rj@MD. 
Interested readers may refer to the paper by Zhao (1994) for specific discussion concerning the problem 
of conflict resolution. 
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Figure 1: Example of related data in spectrum interpretation 

Consider the above example of spectrum interpretation again. If spectral data are in- 
accurate (i.e., some measured peaks look like but are not exactly the same as reference 
peaks), considering qualitative correlations among related data may lead to qualitative ev- 
idence for the identification of inaccurate data. For example, suppose the frequency of a 
peak is slightly different from the reference value, and both the strength and width of the 
peak are the same as the reference values. Then the frequency of the peak may stiU be 
identified since both of its related data support it. Similarly, if peaks at low frequency sec- 
tions are inaccurate, considering related peaks at high frequency sections may help identify 
these peaks, and vice versa. 

3.2 Support Coefficient Function 

Definition 3.3 Support coefficient function (SCF): If there are m — 1 data related to di, 
then the support coefficient function of di calculates the total effects from the related data 
by considering the qualitative correlations between d{ and each of its related data. 

Suppose ir(di, dj) represents the qualitative correlation between di and dj, then the 
support coefficient function of di can be defined as: 

m 

SCFi should directly depend on how many and how much related data support di. 
When SCFi is greater than a certain value given by domain experts, the related data tend 
to support df, otherwise, the related data tend to depress di. 

3.3 Evidence Based on SCF 

In Section 2, we used "di@Rj v to express that di can be qualitatively identified from Rj. 
Realizing "di@Rj v requires a definition of a shift interval A for Rj such as: 
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Rj ± A = {{r jt ± A) | / = l,2,...,m}, 

and a definition of the possibility of "cZj G i?j ± A" . 

The above formula is similar to that in fuzzy logic, but contains completely different 
meanings. The primary difference is that the shift intervals are dynamically determined by 
SCF{, while in fuzzy logic, the fuzzy intervals are usuaUy provided by domain experts in 
advance or calculated with quantitative criteria. 

Definition 3.4 Shift interval: Shift interval is a dynamic region for inaccurate data. Given 
a standard fuzzy interval for inaccurate data, the shift interval of di varies around the 
standard fuzzy interval on the basis of SCFi. When SCFi shows that the related data 
support di, the shift interval of di becomes wider than the standard fuzzy interval. On the 
other hand, when SCFi shows that the related data do not support di, the shift interval of 
di becomes narrower than the standard fuzzy interval. 

Definition 3.5 Evidence based on SCFi: SCFi determines the shift interval of di, that is, 
SCFi determines how widely di is allowed to shift. The wider the shift interval, the more 
easily di is identified. Therefore, SCFi provides confirmatory or dis confirmatory evidence 
for identifying di. 

4. Making Qualitative Hypotheses for Inaccurate Data 

In this section, we introduce and analyze our method for identifying inaccurate data. We 
first discuss the processes of realizing two essential predicates in our method, "di@Rj v and 
a Rj@MD" respectively. Then, we present an algorithm for making qualitative hypotheses 
for inaccurate data (i.e., for realizing the predicate calculus described in Section 2). 

4.1 Predicate "di@Rj" 

When di is accurate, "di@Rj v is equal to "cZj G Rj v ■ If there is a reference value in Rj which 
corresponds to di (i.e., rj p G Rj and rj = di), then di@Rj — T. If there is no reference 
value corresponding to di, then di@Rj — F. When di is inaccurate, however, it is not sure 
whether rj p corresponds to di. In this case, "di@Rj v means that di possibly (qualitatively) 
belongs to Rj, or in other words, rj possibly (qualitatively) corresponds to di. The value 
of "di@Rj v is not T or F, but the possibility of "r jp = dj" or "di G Rj v . 

We discussed in Section 2 that in some cases the identity of qualitative features is more 
robust and reliable than quantitative similarity or closeness. We have also discussed that 
qualitative correlations among related data can lead to evidence for the identity of qualita- 
tive features in diagnosis or interpretation. So if rj (rj G Rj) is assumed to correspond to 
di, and there are m—1 reference values (rj 1 , rj 2 , rj p _ 1 , rj p+1 , rj m ) related to rj p , then 
each of the m—1 reference values should correspond to a certain data item in MD, and the 
m—1 data items in MD are also related to each other. Therefore, qualitative correlations 
between di and its m — 1 related data items in MD should be considered. 

Our method first determines the possibility of u rj p — d" by calculating the similarity 
or closeness between rj and di like conventional fuzzy methods, then considers qualitative 
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correlations among related data to obtain evidence for updating the possibility. When the 
qualitative correlations show that the related data support u rj p — di" , the possibility of 
u rj p — di v will increase. When the qualitative correlations show that the related data do 
not support u Tj — di" , the possibility will decrease. 

4.1.1 Defining Support Coefficient Function 

Suppose rj q {rj G Rj) corresponds to df. Because rj q is related to rj , dt is related to d{. 
As we have discussed, the qualitative correlation between di and dt means that if dt exists, 
then di is enhanced; otherwise, di is depressed. 

We first define the qualitative correlation between two related data items, di and dt, as: 




1 if dt can be found from MD which satisfies: rj q — d < dt < rj q + d Q 

if dt can not be found from MD which satisfies: r, — d < dt < r, + d Q 



where d a is a standard fuzzy interval of inaccurate data, and Cj(dt) expresses the qualitative 
correlation between di and dt. Ci(dt)—1 means that di is enhanced since its related data 
item dt can be found from the measured dataset, and Ci(dt)—0 means that di is depressed 
since its related data item dt can not be found from the measured dataset. The definition 
of Ci(d t ) is simply based on the consideration that if a data item is identified, then the data 
item will support its related data items (i.e., the coexisting data items). 

As there are m reference values in Rj, we can define the support coefficient function 
SCFi for di based on Ciidt) (t — 1,2, ...,m, t ^ i): 

m 

where < SCFi < 1, and SCFi expresses the total qualitative correlations between di 
and all of its related data. In other words, SCFi reflects the support coefficient of rj p 
corresponding to di. 

If m — 1, then SCFi — 1- When m > 1, SCFi is m the direct ratio to the number of 
the related data which may be identified from MD. 

4.1.2 Determining Dynamic Shift Interval 

Suppose d is a standard fuzzy interval of inaccurate data, we define the dynamic shift 
interval of di based on SCFi as: 

Adi = (2m " 1)4 x SCFi 
m 

where < Adi < 2cZ G , and Adi is m the direct ratio to SCFi. 

If m — 1, then SCFi — 1, and Adi — d . In other words, when qualitative correlations 
among data are not known a priori, SCFi — 1 an d Adi — d a . In this case, the method 
degenerates to a conventional fuzzy method. 
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When m is fixed, the more the related data are identified, the greater SCFi is, therefore 
the greater Adi is. When SCFi is fixed, Adi depends on the number of related data. 

Table 1 shows the relation among Adi, m and SCFi. 
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Table 1: Relation among Adi, m an d SCFi 



We can draw the following properties from the above formulas. 

Property 1: With the same m, the more the related data are identified, the greater SCFi 
is; otherwise, the smaller SCFi is. 

Property 2: With the same m, the greater the SCFi, the greater is Adi. In other words, 
the more the related data support di, the more widely di is allowed to shift. 

Property 3: With the same SCFi, the greater the m, the less Adi varies along with m. In 
other words, the greater the number of related data, the less a single related data item can 
affect di. 

Property 2 and Property 3 are illustrated in Figure 2. 
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Figure 2: Adi versus m with different SCFi 

Property 4: Adi is in linear relation to SCFi. The slope is equal to, or greater than 1.5, 
which means that Adi heavily depends on SCFi. 
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Property 5: Along with the increase of m, the slope increases very slightly. In other 
words, Adi depends on the number of the related data which support di, rather than the 
total number of related data. 

Property 4 and Property 5 are illustrated in Figure 3. 




SCFi 



Figure 3: Adi versus SCFi with different m 



4.1.3 Calculating Value of Predicate "di@Rj v 

The value of "di@Rj" is equal to the possibility of "rj = e?j" which can be calculated by 
using the foUowing formula: 



'Jp 



Adi 



where fi{ < 1. 



At a glance, the representation of m looks like the membership degree of 



•Jp 



Adi < 



di < Tj + Ad^ in fuzzy logic. However, the meaning is completely different, for Adi is 
neither provided by domain experts nor determined by quantitative similarity or closeness. 
Here Adi is determined on the basis of qualitative correlations among related data. When 

qualitative correlations among related data are not considered, Adi is d a , and the possibility 

\di—T- | 

is 1 1 d ' P • With the consideration of qualitative correlations, the possibility is updated. 

Two new properties can be drawn from the above formula for calculating /x,. 

Property 6: With the same di, the greater the Adi, the greater is m. In other words, the 
wider the dynamic shift interval, the greater is the value of "di@Rj ". Formally, if Ad"> 
Ad' { >Adi, then ^'>^->^. 

Property 7: SCFi provides qualitative evidence for accepting or rejecting di as rj p since 
Hi is in the direct ratio to Adi, and Adi is in the direct ratio to SCFi. 
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Property 7 are illustrated in Figure 4. 




Figure 4: Value of "di@Rj" versus various Adi 



The above process of realizing "di@Rj" and calculating the value of "di@Rj" can be 
expressed by the following procedure. 

Procedure di@Rj 

select rj p from Rj; 
SCFi = 0; 

*/ d i = r iA 

SCFi = 1; 

h = i; 

} 

else{ 

for each r^ £ Rj (/ = 1, ...,m, I ^ p){ 
calculate Cj(dt) 4 ; 
SCFi = SCFi + aidt); 

} 

SCFi = (l + SCFi)/m; 

Adi = ix SCFi x (2m - l)/m; 

/fi = 1- | di - | /Adf, 

} 

4. dt stands for the data item in MD which corresponds to rj l . 
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if H > 

return m; 

else 

return NIL 
end procedure 

When d{ can be identified with a certain possibility (i.e., U{ > 0), the procedure returns 
T (i.e., the value of ui); otherwise, the procedure returns F. 

4.2 Predicate "Rj@MD" 

When MD is accurate, "Rj@MD" is equal to "Rj C MD". If all the m reference values in 
Rj can be identified from MD, then Rj@MD = T; otherwise Rj@MD = F. When MD is 
inaccurate, however, "Rj@M D v means that _Rj is possibly (qualitatively) a subset of MD. 
The value of "Rj@MD v is not T or F, but the possibility that aU the reference values in 
Rj can be identified from MD. 

If ui > (/ = 1,2, ...,m), then i?j can be regarded as a subset of MD with a certain 
possibility. Let s\, S2, and s TO be the priorities of the reference values in Rj, then the 
value of "Rj@MD" can be calculated based on fi2, and u m by using the foUowing 
formula: 

Suppose ^ has been calculated by using procedure di@Rj, then the process of realizing 
u Rj@MD v and calculating the value of ll Rj@MD v can be expressed by a simple procedure. 

Procedure Rj@MD 

P — S{ x fii ; 

for I — 1 to m (I 7^ p){ 
*/ W > o{ 

P = P+ s { X /f,; 
# = S + si; 

} 

else{ 

P=0; 

exit; 

} 

} 

if P > 

return P/S; 

else 

return NIL 
end procedure 
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When Rj can be identified as a subset of MD with a certain possibility (i.e., P/S), the 
procedure returns T (i.e., the value of P/S); otherwise, the procedure returns F. 

4.3 Algorithm for Making Qualitative Hypotheses for Inaccurate Data 

We give the following algorithm for interpreting/analyzing measured data based on pro- 
cedure di@Rj and procedure Rj@MD. When measured data are not accurate, the 
algorithm can identify inaccurate data items by considering qualitative correlations among 
related data. 

Algorithm Making-Qualitative-Hypotheses 

IN(MD) = 0; 

for i — 1 to n { 

for j — 1 to k { 
P{Rj) = 0; 

if di@Rj (i.e., Procedure di@Rj) 

if Rj@MD (i.e., Procedure Rj@MD) { 
Rj -»■ IN(MD); 
P(Rj) = Rj@MD; 

} 

end if 
end if 

} 

end for 

} 

end for 
end algorithm 

In the algorithm, P(Rj) represents the value of "Rj@MD" . The algorithm is actually 
the realization of the predicate calculus: Vd i VR j ((d i @R j ) A (Rj@MD) Rj C IN(MD)). 

For each measured data item in {di, cfe, d n }, the algorithm searches {iii, R2, 
Rk} once. For each Rj (Rj — {rj 1 , rj 2 , rj m }), the algorithm checks other n — 1 measured 
data items for m times, and other m — 1 reference values for n times. Therefore, with blind 
search, the number of operations is about (at worst): n x k x [m x (n— l)+nx (m — 1)] = 
2xkxmxn 2 — k X n 2 — k X m X n. Since k and m are two constants, the complexity of 
the algorithm is 0(n 2 ). 

5. Application to Infrared Spectrum Interpretation 

We have developed a knowledge-based system for interpreting infrared spectra by applying 
the proposed method, and have fully tested the system against several hundred real spectra. 
The experimental results show that the proposed method is significantly better than the 
conventional methods used in many similar systems. 
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5.1 Infrared Spectrum Interpretation 

The primary task of infrared spectrum interpretation is to identify unknown objects by 
interpreting their infrared spectra. In this paper, we will limit the problem to interpretation 
of infrared spectra of compounds to determine composition of unknown compounds without 
loss of generality. 

Selecting infrared spectrum interpretation as the domain of application is out of the 
foUowing reasons: 

1. Interpreting infrared spectra is a very significant problem in both academic research 
and industrial apphcation. For example, in chemical science and engineering, inter- 
preting infrared spectra of compounds is the most effective way to identify unknown 
compounds, and to analyze the composition and purity of compounds (Colthup, Daly, 
k Wiberley, 1990). 

2. Interpreting infrared spectra is a very difficult problem. First, spectral data are huge 
in quantity, and complex in representation. Second, both symbolic reasoning and 
numerical analysis are needed to interpret infrared spectral data (Puskar, Levine, & 
Lowry, 1986; Sadtler, 1988). 

3. Interpreting infrared spectra is a typical problem dealing with inaccurate data since 
spectral data are often inaccurate. They often shift from their theoretical values due to 
various reasons. For example, the foUowing is an assertion for spectrum interpretation: 

The high frequency peak of partial component PC a is located at Fi. 

In practice, however, the peak of PC a may irregularly shift around F{ due to noise or 
other unforeseen reasons. When the above assertion is used to identify real spectra, 
uncertainty arises. 

5.2 Applying the Proposed Method to Infrared Spectrum Interpretation 

Interpreting infrared spectra is a special problem of diagnosis. Suppose the infrared spec- 
trum of an unknown compound can be thresholded and represented as a finite set of peaks 
(i.e., the measured dataset MD): 

Sp = {pi,P2,-,Pn}, 

where every peak consists of the frequency (position) /, strength (height) s, and width 
(shape) w, respectively: 

Pi - (fi,Si,Wi) i — l,2,...,n. 

Because , Sj and Wj refer to the same peak pi , they are related data. This is the first 
kind of related data in infrared spectrum interpretation. 
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Suppose there are finite partial components (i.e., reference values RV): 
PC = {PC 1 ,PC 2 ,...,PC k } 

= {{Ph>Ph>—>Pi m } I 3 = 2 > •••>*} 

= {{(fj P , s j P , w j P ) \ P= l,2,...,m} | j = 1,2,...,*;}. 

Because /j , Sj p and Wj p also refer to the same reference peak pj p , they are the first kind 
of related data as well. 

The spectroscopic knowledge for interpreting infrared spectra is usually expressed as "if 
Pi is equal to pj p! then pi may be created by partial component PCj 1 . Here ll pi is equal to 
Pj P " represents that S{, and W{ are equal to fj , Sj , and Wj respectively. 

The first kind of related data has the following qualitative correlations: 

1. fi, Si and wi should be identified simultaneously, that is, 

• if fi is fj , then is Sj and Wi is Wj , and 

• if Si is Sj p , then fi is fj and Wi is Wj , and 

• if Wi is Wj p , then /, is f jp and is s jp . 

2. related data support each other. For example, if both fi and have been identified, 
then they will enhance the identification of . Conversely, if fi and have not been 
identified, then they will weaken the identification of Wi . 

Our method for identifying fi, and Wi based on the qualitative correlations among 
them can be formalized as the following predicate calculi, respectively: 

VfiVpjp((fi@Pj P ) A ipj p @Pi) Pi is created by PCj), and 
V s iVpjp(( s i @ Pj P ) A (Pj P @ Pi) —> Pi i s created by PCj), and 
V w iVpjp(( w i @ Pjp) A (Pj P @ Pi) — »■ * s created by PCj), 

where "p, zs created by PCj" means that fi, and can be qualitatively identified to be 

fi , Sj and ws . 

" jp ' jp jp 

In general, each partial component may create finite peaks at the same time. So if pi is 
created by PCj , then Sp is partiaUy created by PCj ; if Sp is partially created by PCj , then 
all the peaks that PCj may create should be contained by Sp simultaneously. Therefore, 
aU the peaks created by a partial component are also related data. This is the second kind 
of related data in infrared spectrum interpretation. 

The second kind of related data has the following qualitative correlations: 

1. all the peaks of a partial component should be identified simultaneously, that is, 

if pi is pj p , then pj l G Sp (I — 1,2, ...,m,l ^ p). 

2. the peaks created by the same partial component support each other. For example, 
if most of the peaks of a partial component have been identified, these peaks will 
enhance the identification of the rest peaks. Conversely, if most of the peaks of a 
partial component can not be identified, then the identification of the rest peaks will 
be depressed. 
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Our method for identifying related peaks based on the qualitative correlations can be 
formalized as the following predicate calculus: 

\/ Pi \/PC j (( Pi @PC j ) A (PCj@S P ) PCj C IN(Spj). 
5.3 System for Interpreting Infrared Spectra 

Our system is implemented with C and MS-WINDOWS. Figure 5 shows the data flow 
diagram of the system. 



spectroscopic 
knowledge 




solution 





u- 


H 

— C H 
H 


-c-o-c- 









Sp: Unknown Infrared Spectrum 



IN(Sp): Interpretation of Sp 



Figure 5: Data flow diagram of the system 

The input data of the system are infrared spectra of unknown compounds, and the 
solutions are partial components that the input spectra may contain. Because inferences 
are based on qualitative features of spectral data and qualitative correlations among related 
data, the system can gain high correct interpretation performance with noisy spectral data. 

As we mentioned before, there are two types of related data in infrared spectrum in- 
terpretation: all the features of a single peak (i.e., and Wi of pi), and aU the peaks 
of a single partial component (i.e., p\, p2, ... and p m ). The inference engine of the system 
employs the proposed method to both types of the related data when inaccuracy arises. 

5.4 An Example 

We discuss the performance of the system through the following example. Figure 6 shows 
an infrared spectrum of an unknown compound. The spectrum is very hard to interpret 
since the peak with an arrow (named p\) shifts substantially. Our system correctly identifies 
that pi is created by partial component benzene-ring. 

In contrast, many similar systems can not correctly identify the peak (Clerc, Pretsch, 
& Zurcher, 1986; Hasenoehrl, Perkins, & Griffiths, 1992; Wythoff, Buck, & Tomellini, 
1989) since the peak of a benzene-ring at this frequency position (named p^) should be 
a strong peak (i.e., s^i > 1.000) according to spectroscopic knowledge, not a medium one 
(si = 0.510) as the case in this example. Systems based on conventional fuzzy methods 
usually assume a fuzzy interval for each inaccurate peak, then determine the membership 
degree that the inaccurate peak is in the fuzzy interval. Suppose the reference value for 
a strong peak is 1.000, and the fuzzy interval for a strong peak is 0.300 (Colthup, Daly, 
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& Wiberley, 1990), then only peaks with strength of 1 ± 0.300 can be regarded as strong 
peaks. Obviously, by conventional fuzzy methods, the possibility of pi being a strong peak 

is Zero, i.e., flbenzene—Ting{,^l^) 0. 

Inferring on the basis of qualitative correlations among related data, our system makes 
a correct interpretation of the spectrum. Through the following two cases, we introduce the 
inference process of the system, and at the same time demonstrate the use of our method 
for identifying inaccurate data. 



0.000 




1 .200 1 

4000 Frequency(cm-I) 600 

Figure 6: An example of infrared spectrum 
5.4.1 Case I: Considering the First Kind of Related Data 

Because the frequency (position) and width (shape) of p\ are both the same as those of 
benzene-ring, the possibility of /i being identified as fbi is 100% (i.e., I^benzene-ring(fl) ~ l)j 
and the possibility of w\ being identified as w^i is also 100% (i.e., fibenzene-ring{ w l) — I 5 - 

As we have discussed before, /i, si and w\ are related data, so we can obtain confirm 
evidence for identifying s\ by considering qualitative correlations among si, /i and w\: 

h^benzene—ring^fl) lj 

so, c Sl (f\) — 1 (c Sl (fi) represents the qualitative correlation between s\ and /i), 

f^benzene—ringi^l^) 1; 

so, c Sl (wi) — 1 (c Sl (wi) represents the qualitative correlation between si and w\) 
so, SCF S1 = ±±2 = 1, and 

= (6-DX0.300 x x = Q 500; 

<i®p* = 1 " tMt = °- 02 - 

5. means the possibility of d being identified by conventional fuzzy methods, i.e., SCF is not 

considered. 
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By considering SCF Sl , the possibility of p\ being regarded as a strong peak of benzene- 
ring increases from to 0.02. As possibility, 0.02 may not be different from 0.04 or 0.06, but 
0.02 is significantly different from 0. Many near-misses may be handled by the negligible 
possibility. For example, in most systems based on fuzzy and other methods (Clerc, Pretsch, 
& Zurcher, f986), it is impossible to identify p\ to be "strong" (i.e., I^benzene-ring( s l) — 0), 
but considering qualitative correlations among related data makes it possible although the 
possibility is only 0.02. 

As mentioned before, f\ and w\ are both the same as the reference values, so fi@Pb! — 1, 
and = f . 

Suppose the priorities of /i, si and w\ are 2, f and I respectively, then the possibility 
of pi being identified as p bl is: 

2x1 + 0.02 + 1 n ^ rr 
m = Pbl @ Pl = = 0.755. 

5.4.2 Case II: Considering the Second Kind of Related Data 

The process of considering the second kind of related data is quite similar. 

We have got that the possibility of p\ being created by a benzene-ring is [i\ (/ii — 0.755). 
Suppose the benzene-ring can create m peaks: {p bl , Pb 2 i ■■■■> Pb m }, then the m peaks are 
related to each other. If p\ is created by the benzene-ring, then Sp is partiaUy created 
by the benzene-ring, i.e., the benzene-ring is contained by the unknown spectrum; if Sp 
is partially created by the benzene-ring, then the other m — 1 peaks of the benzene-ring 
should also be identified. 

By using the same procedure as obtaining fii, we can get /t2, fi3, ••• and ji m as well. 
According to our method, the qualitative correlation between two related peaks, pi and pj, 
is defined as: 



Ci(pj) = 



1 if fij > 0.5 

if /ij < 0.5. 



So 



SCFi = ^hl±l o < SCFi < 1. 

m 



Let d — 1, then 



and 



2m — 1 

Adi = x SCFi, < Adi < 2, 



f — yUj 

Pi@benzene — ring — 1 , pi@benzene — ring < 1. 

Ad,- 
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Roughly, when SCF{ > 0.5, related peaks tend to support p{. When related peaks 
support pi, Adi > 1. When Adi > 1, pi@benzene — ring > /tj. 

Table 2 shows the relation among pi@benzene — ring, m and Adi. 



Pi@benzene — ring 


Pi 






1 


0.8 


0.5 


0.3 







1.3 


1 


0.846 


0.615 


0.462 


0.231 




1.1 


1 


0.818 


0.545 


0.364 


0.091 


Adi 


1 


1 


0.8 


0.5 


0.3 







0.9 


1 


0.778 


0.444 


0.222 


-0.111 




0.7 


1 


0.714 


0.286 





-0.429 



Table 2: Relation among pi@benzene — ring, m and Adi 



In the above example, SCF\ — 0.850, and Ad\ — 1.658, so 



Pi@benzene — rinq — 1 = 0.852. 

1 y 1.658 

Therefore, the possibility of p\ being identified as pi )1 increases from 0.755 to 0.852 due 
to qualitative correlations among related peaks. The process is similar to the probability 
propagation in probabilistic reasoning. Here identifying pi is a hypothesis, and qualitative 
correlations among related data of p\ are pieces of evidence. 

After all the peaks of the benzene-ring are identified, the possibility that the benzene- 
ring is contained by Sp can be finally calculated by employing the same method as described 
in Section 5.4.1. 



5.5 Analysis of Experimental Results 

We compare two methods in the experiments. The first method (called "AF") is a conven- 
tional fuzzy method which is used by most similar systems (Clerc, Pretsch, & Zurcher, 1986; 
Wythoff, Buck, & TomeUini, 1989). To use AF , each reference value must be associated 
with a fuzzy interval for dealing with inaccuracy. Both reference values and fuzzy intervals 
are empirically determined (Colthup, Daly, & Wiberley, 1990). 

Table 3 lists some reference values and their fuzzy intervals used by AF. 
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CH3 2960 ± 15cm _1 strong ± 0.3 sharp ±1 

2870 ± 15cm _1 strong ± 0.3 sharp ± 1 

1450 ± 10cm -1 medium ± 0.3 sharp ± 0.5 

benzene — ring 3055 ± 25cra~ 1 strong ± 0.3 sharp ± 1.5 

1645 ± 10cm -1 medium ± 0.3 sharp ± 0.5 

1550 ± 30cm _1 medium ± 0.3 sharp ± 1 

1450 ± 3cm _1 medium ± 0.3 sharp ± 

-CH2-OH 3635 ± 5cm- 1 s*ron# ± 0.3 fcroad ± 1 

3550 ± 25cra~ 1 strong ± 0.3 sharp ± 1 



Table 3: Some reference values and their fuzzy intervals 
The membership function of is: 



I d — r I 

fi r (d) = maz{0,l — — j, 



where is a measured data item, r is a reference value, Ad is the fuzzy interval of r, and 
< n T {d) < 1. 

The second method (called "^4i^*") is the proposed method, ^li* 1 * uses the same ref- 
erence values and fuzzy intervals as AF, but the fuzzy intervals in AF* are only used as 
standard fuzzy intervals based on which dynamic shift intervals are determined by consid- 
ering qualitative correlations among related data. 

AF and AF* use the same reference values and empirical fuzzy intervals. The formula 
for calculating membership degrees in AF (i.e., /J, r (d) — max{0, 1 — }) is also similar to 

\di-r- I 

the formula for calculating possibility in AF* (i.e., m — 1 ~KJ 2 ~)- However, in AF, AcZ 

is simply an empirical fuzzy interval, while in AF* , AcZj is a dynamic shift interval based 
on qualitative correlations among related data. 

We have tested the system against several hundred real infrared spectra of organic 
compounds. The experimental results show that AF* is significantly better than AF. 

Table 4 lists part of the experimental results in which the first column indicates the 
solutions obtained by AF; the second column indicates the solutions obtained by AF*; and 
the third column shows the correct solutions. 
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AF (Without SCF) 


AF* (With SCF) 


Correct Solutions 


<S> -0H2- CH3- -[0H2Jn— 


^ -CH2- OHo- -|0H2Jn- 


-0H2- CHo- -|0H2Jn- 


© -CH2- -C- 


® -CH2- CH3- -C- 


-CH2- CH3- -C- 


CH3 

® -CH2- CH3- -CH 
CH3 


CH3 

® -CH2- CH3- -CH 
CH3 


CH3 

-CH2- CH3- -CH 
CH3 


m -CH2- CH3- -C- 


® -CH2- CH3- -C- 


-CH2- CH3- -C- 


CH3 

© CH3- -CH 
CH3 


CH3 

® CH3- -CH 
CH3 


CH3 

CH3- -CH Q 
CH3 


© -CH2- CH3- >C=CH- 


@ -CH2- CH3- >C=CH- 




-CH2- CH3- >C=CH- 




m CH3- 


© -CH2- CH3- Q 


CH3- 


© -CH2- CH3- 


© -CH2- CH3- 


-CH2- CH3- () 


© >C=CH- 


© CH3- >C=CH- 


CH3- >C=CH- Q] 


# -[CH2]n- -C=CH Q] 

L J 


# -[CH2]n- -C=CH f) 


-[CH2]n- -C=CH Q] 


O -CH2- CH3- >C=CH- 
-CH[CH3]2 


m -CH2- CH3- >C=CH- 
-CH[CH3]2 Q] 


-CH2- CH3- >C=CH- 
-CH[CH3]2 Q 


© -CH2- QQ 


® -CH2- CH3- QQ] 


-CH2- CH3- CD 


e < 


/CI 

® -C=C- c 7 

^Cl 


/CI 

-C=C- c 7 

c\ 


© CH3- Q NH2- 


# CH3 - _c_ 

NH2- 


CH3- _i 
NH2- 



(iQi) : identified PC set is the same as the PC set in the correct solution (in this case, RI=1) 



: identified PC set is not the same as the PC set in the correct solution (the number indicates the RI ) 



Table 4: Experimental results with AF and AF* 
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There are two important standard metrics for evaluating solutions of infrared spectrum 
interpretation: 

Definition 5.1 Rate of correctness (RC): the rate that the identified partial component set 
is exactly the same as the partial component set in the correct solutions. 
Definition 5.2 Rate of identification (RI): the rate that the partial components in the 
correct solutions are identified. 

Table 5 shows the comparison between AF and AF* with the two standard metrics. 





RC (error-rate) 


RI (error-rate) 


AF 


0.455 (0.545) 


0.812 (0.188) 


AF* 


0.736 (0.264) 


0.894 (0.106) 



Table 5: Evaluation of AF k AF* with RC and RI 



Table 5 demonstrates that both the RC and RI increase by integrating SCF, but the 
RC increases more significantly. The reason is that although AF can identify most partial 
components of unknown compounds, the rate that it can identify all partial components 
of unknown compounds is low because there are always some partial components whose 
measured peaks seriously shift from the reference values. 

5.6 Comparison with Related Systems 

Related systems mainly faU into the foUowing four categories: (1) Systems based on Y/N 
classification, (2) Systems based on fuzzy logic, (3) Systems based on pattern recognition, 
and (4) Systems based on neural networks. 

5.6.1 Systems Based on Yes/No Classification 

The method commonly used by spectroscopists in practice is numerical analysis (Colthup, 
Daly, k Wiberley, 1990). Numerical analysis is primarily based on comparison between spec- 
tral data and reference values. Reference values are usually some regions like frequency : 
3615±5crn~ 1 or strength : 1.000 ±0.300. If spectral data are in certain regions, the answer 
of classification is yes; otherwise, the answer is no. 

Most systems for interpreting infrared spectra use this method (Hasenoehrl, Perkins, k 
Griffiths, 1992; Puskar, Levine, k Lowry, 1986; Wythoff, Buck, k TomeUini, 1989). For 
example, in Wythoff 's system, rules for comparing spectral data are in the following forms. 

ANY PEAK(S) FREQ UENCY:1 100-1101 S TRENG TH: 0.1-1.0 

WWTH.SHARP TO BROAD 

ANSWER - YES- 
ACTION - *** 



The advantage of these systems is that they are very easy to develop because they 
can directly use spectroscopic knowledge, and do not need further computation. However, 
the problem is that each of these systems is only applicable to a class of compounds, or 
pure compounds because in the case of seriously inaccurate spectral data, the reference 
values (regions) can not reflect the inaccuracy. For example, Hasenoehrl's system is only 
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for distinguishing compounds containing at least one carbonyl functionality from other 
compounds, although the RI of the system is about 98% (naturaUy, the RC is not available), 
and Puskar's system is only for identifying hazardous substances. 

In fact, spectroscopists also use qualitative analysis in some specific cases in addition to 
the formal spectroscopic knowledge, such as "if the peaks in 600 cm -1 - 900 cm -1 look like 
the peaks of benzene-rings, then the peaks in 3000 cm -1 - 3100 cm -1 are quite likely to be 
created by a benzene -ring 11 . Unfortunately, the qualitative analysis was hardly applied to 
these systems since it can not be used in usual ways. In contrast, our system can successfuUy 
use the qualitative analysis like spectroscopists. The way of using it is the method proposed 
in this paper. As a result, our system is applicable to all compounds which exhibit high 
performance with respect to correctness. 

5.6.2 Systems Based on Fuzzy Logic 

Since spectral data are always inaccurate, and the representation of spectroscopic knowledge 
is quite like that in fuzzy logic, some systems naturally use fuzzy logic or some techniques 
similar to fuzzy logic (Clerc, Pretsch, & Zurcher, 1986). In these systems, fuzzy intervals 
which are similar to the regions described in Section 5.6.1 are given for reference values, 
and memberships of inaccurate data are calculated on the basis of the degrees that the 
inaccurate data are in the fuzzy intervals. These systems are better than those described in 
Section 5.6.1 in some cases, but the degrees that inaccurate data are in fuzzy intervals do 
not necessarily reflect the possibility of the inaccurate data being the reference values. For 
example, in Figure 7, it is difficult to determine which peak is closer to the reference value 
only by considering the degrees that peak a and peak b are in the fuzzy interval. 



However, by applying the method proposed in this paper, the above problem can be 
easily solved. As we discussed in Section 5.6.1, in practice spectroscopists also frequently 
use knowledge about correlations among peaks in addition to the formalizable spectroscopic 
knowledge. This kind of knowledge is essential to our method which enables us to use 
qualitative correlations among related data as evidence for the identification of inaccurate 
data. 

We have compared the fuzzy method used by these systems with our method in Section 
5.5. So far as we know, the RC of our system is the highest among the similar systems, 
and the RI of our system is higher than that of most of the systems. 

5.6.3 Systems Based on Pattern Recognition 

Some systems use pattern recognition techniques to interpret infrared spectra (Jalsovszky & 
Holly, 1988; Sadtler, 1988), of which SADTLER is the most popular commercial system. The 




peak a 



Figure 7: Two peaks in a fuzzy interval 
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system compares known patterns with unknown ones, and determines the possibility of an 
unknown pattern being a known one by calculating the quantitative similarity or closeness 
between the two patterns. 

Unlike fuzzy techniques, pattern recognition considers a group of data (i.e., a pattern) 
at the same time. However, pattern recognition is primarily based on quantitative analysis. 
We have discussed that in many cases especially when the inaccuracy of spectral data is not 
slight, qualitative features of spectral data are much more important than quantitative ones. 
For example, Figure 8 shows two simple cases. The difference between the two patterns in 
(a) is smaller than that in (b). From the viewpoint of SADTLER, the two patterns in (a) 
are closer than those in (b). However, the two patterns in (b) may be the same in some 
cases, while the two patterns in (a) may not be the same in any case. The reason is that the 
qualitative features (frequency positions of peaks) of the two patterns in (a) are different. 



Because quantitative similarity and closeness are not always sound, most systems based 
on pattern recognition including SADTLER can not give concrete solutions. In general, the 
solutions of these systems are only a series of candidates from which users have to finally 
decide the possible one by themselves. It is difficult to compare these systems with ours 
because the solutions of these systems are quite loose, and neither the RC nor the RI is 
available. Sadtler, for example, usually gives the list of aU known patterns associated 
with the values of quantitative differences between the unknown patterns and these known 
ones. 

5.6.4 Systems Based on Neural Networks 

Recently, neural networks have been applied to infrared spectrum interpreting systems 
(Anand, Mehrotra, Mohan, & Ranka, 1991; Robb & Munk, 1990). In Anand's system, a 
neural network approach is used to analyze the presence of amino acids in protein molecules. 
To this specific classification, the RI of Anand's system is about 87%, and the RC is not 
available. In Robb's system, a linear neural network model is developed for interpreting 
infrared spectra. The system is for general purpose like our system. Without prior input of 
spectrum-structure correlations, the RC of Robb's system is equal to 53.3%. 

Although the RC and RI of our system are both higher than those of the two systems, 
we stiU think that using neural networks is very promising, especially when model training 
or system learning is a must. The research concerning applying neural networks to our 
system is left for the future. 




(a) 



(b) 



Figure 8: Quantitative differences between patterns 
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6. Related Work and Discussion 

Identifying inaccurate data has long been regarded as a significant and difficult problem in 
AI. Many methods and techniques have been proposed. 

Fuzzy logic provides the mathematical fundamentals of representation and calculation 
of inaccurate data (Bowen, Lai, & Bahler, 1992; Negoita & Ralescu, 1987; Zadeh, 1978). 
Our method is primarily based on fuzzy theory. But compared with conventional fuzzy 
techniques, the advantages of our method include: (1) fuzzy intervals of inaccurate data 
are dynamically determined so that dynamic information can be used; (2) fuzzy intervals 
are based on qualitative features of data and qualitative correlations among related data so 
that the solutions are more robust. The limitation of our method is that when qualitative 
correlations among related data are not known in advance, the method degenerates to a 
conventional fuzzy method. For instance, if SCF is unavailable, the two methods described 
in Section 5.5 become the same. 

Pattern recognition provides the techniques for interpreting measured data in group 
(Jalsovszky & Holly, 1988). By pattern recognition methods, related data and connections 
among data can be considered. However, there are two preconditions which must be satisfied 
for complex data analysis by pattern recognition to be successful. The first precondition 
is that we have to obtain adequate data bases from which we can derive the patterns we 
need to recognize, and the second precondition is that we have to demonstrate that there 
are suitable metrics of similarity between patterns. When patterns explicitly exist, and 
measured patterns are not seriously noisy (e.g., fingerprint recognition), pattern recognition 
methods are effective. However, if patterns are not explicit, or patterns change irregularly 
which implies that there is not a stable metrics for determining the similarity between 
patterns (e.g., spectrum interpretation), our method is more practical and robust. 

In identifying inaccurate data, the roles of ll di@Rj v and u Rj@MD v are quite similar 
to the role of subjective statements or prior probabilities in other systems (Duda, Hart, 
& Nilsson, 1976; Shortliffe & Buchanan, 1975). However, the essential difference is that 
our method dynamically calculates the values of "cZj@i?j" and u Rj@MD v from qualitative 
correlations among related data so that it does not need many assumptions beforehand, 
and can avoid inconsistency in knowledge and data bases. Our method can also handle 
possibility propagation among inference networks. Readers may have noticed it from the 
process of considering the second kind of related data in spectrum interpretation (see Section 
5.4.2). 

When statistical samples are sufficient, or subjective statements can be consistently ob- 
tained, probabilistic reasoning methods can be applied to inaccurate data identification. 
When statistical samples of inaccurate data are not enough and consistent subjective state- 
ments are not available, our method is very effective. 

Our ongoing research related to probabilistic reasoning is to consider the interaction 
among identified partial components. As we discussed before, spectroscopists frequently 
use the knowledge such as "if C%H% coexists with CH%, then the peaks of CH3 around 
2900 cmT 1 may shift", or "if -C-O-C- has been identified, then the strength of the peaks of 
CH3 may change". Therefore, it is possible to update the possibilities of identified partial 
components by considering the interaction among them. Using probabilistic reasoning to 
analyze the effects among identified partial components would not only help us identify 
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inaccurate data, but also provide us with the reason why the data are inaccurate. The 
research and experiments will be the subject of our sequel paper. 

7. Conclusions 

In this paper, we have presented a new method for identifying inaccurate data on the 
basis of qualitative correlations among related data. We first introduced a new concept 
called support coefficient function (SCF). Then, we proposed an approach to determining 
dynamic shift intervals of inaccurate data based on SCF, and an approach to calculating 
possibility of identifying inaccurate data, respectively. We also presented an algorithm 
for using qualitative correlations among related data as confirmatory or disconfirmatory 
evidence for the identification of inaccurate data. We have developed a practical system 
for interpreting infrared spectra by applying the proposed method, and have fully tested 
the system against several hundred real spectra. The experimental results show that the 
proposed method is significantly better than the conventional methods used in many similar 
systems. In this paper we have also described the system and the experimental results. 

Briefly, our novel work includes: 

1. A method which assumes an inaccurate data item to be a certain reference value on 
the basis of qualitative correlations between the inaccurate data item and all of its 
related data. 

2. An algorithm which crystallizes the method. 

3. A practical system which uses the algorithm to interpret infrared spectra. 
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